CubeSVD: A Novel Approach to Personalized Web Search’ 


Jian-Tao Sun 
Dept. of Computer Science 
TsingHua University 
Beijing 100084, China 


sjt@mails.tsinghua.edu.cn 


Yuchang Lu 
Dept. of Computer Science 
TsingHua University 
Beijing 100084, China 


lyc@tsinghua.edu.cn 


ABSTRACT 


As the competition of Web search market increases, there 
is a high demand for personalized Web search to conduct 
retrieval incorporating Web users’ information needs. This 
paper focuses on utilizing clickthrough data to improve Web 
search. Since millions of searches are conducted everyday, 
a search engine accumulates a large volume of clickthrough 
data, which records who submits queries and which pages 
he/she clicks on. The clickthrough data is highly sparse and 
contains different types of objects (user, query and Web 
page), and the relationships among these objects are also 
very complicated. By performing analysis on these data, we 
attempt to discover Web users’ interests and the patterns 
that users locate information. In this paper, a novel ap- 
proach CubeSVD is proposed to improve Web search. The 
clickthrough data is represented by a 3-order tensor, on 
which we perform 3-mode analysis using the higher-order 
singular value decomposition technique to automatically cap- 
ture the latent factors that govern the relations among these 
multi-type objects: users, queries and Web pages. A tensor 
reconstructed based on the CubeSVD analysis reflects both 
the observed interactions among these objects and the im- 
plicit associations among them. Therefore, Web search ac- 
tivities can be carried out based on CubeSVD analysis. Ex- 
perimental evaluations using a real-world data set collected 
from an MSN search engine show that CubeSVD achieves 
encouraging search results in comparison with some stan- 
dard methods. 


Categories and Subject Descriptors 


H.3.3 [Information Storage and Retrieval]: Information 
Search and Retrieval-Search Process; H.3.5 [Information 
Storage and Retrieval]: Online Information Services-Web 
based services 
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1. INTRODUCTION 


The increase of WWW resources has fueled the demand 
for effective and efficient information retrieval. Millions of 
searches are conducted every day on search engines such 
as Yahoo!, Google and MSN, etc. Despite the popularity, 
search engines have their deficiencies: given a query, they 
usually return a huge list of results and the pages ranked at 
top may not meet users’ needs. One reason for this problem 
is the keyword-based query interface, which is difficult for 
users to describe exactly what they need. Besides, typical 
search engines often do not exploit user information. Even 
two users submit the same query, their information need 
may be different [4, 16]. For example, if a query “jaguar” is 
issued to Google, 11,900,000 results are returned. Regard- 
less of who submits the query, both the pages returned and 
the rank orders are identical. Since “jaguar” may refer to 
‘Jaguar car” or “jaguar cats”, two users with different in- 
terests may want the search results ranked differently: a car 
fan may expect car relevant pages ranked highly, however, 
these pages may be unnecessary to be displayed for a zoolo- 
gist. Thus the search results should be adapted according to 
the person who submits the query and which query he/she 
submits. 

Personalized Web search is to carry out retrieval for each 
user incorporating his/her own information need. As the 
competition in search market increases, some search en- 
gines have offered the personalized search service. For ex- 
ample, Google’s Personalized Search allows users to specify 
the Web page categories of interest [1]. Some Web search 
systems use relevance feedback to refine user needs or ask 
users to register their demographic information beforehand 
in order to provide better service[2, 8]. Since these systems 
require users to engage in additional activities beyond search 


to specify/modify their preferences manually, approaches 
that are able to implicitly capture users’ information needs 
should be developed. 

This paper focuses on utilizing clickthrough data to im- 
prove Web search. Consider the typical search scenario: a 
user submits a query to a search engine, the search engine re- 
turns a list of ranked Web pages, then the user clicks on the 
pages of interest. After a period of usage, the server side will 
accumulate a collection of clickthrough data, which records 
the search history of Web users. The data objects contained 
in the clickthrough data are of different types: user, query 
and Web page, furthermore, relationships among these ob- 
jects are complicated [25]. For example, users with similar 
information needs may visit pages of similar topic even they 
submit different queries; users with dissimilar needs may 
visit different pages even they submit the same query, as 
the “jaguar” example indicates. It can be assumed that the 
clickthrough data may reflect Web users’ interests and may 
contain patterns that users found their information [13, 14]. 
By performing analysis on the clickthrough data, we attempt 
to discover the latent factors that govern the associations 
among these multi-type objects. Consequently, Web pages 
can be recommended according to the associations captured. 

Here we clarify some characteristics specific to personal- 
ized Web search based on clickthrough data analysis. This 
task is related to recommender systems which have been 
extensively studied [3, 6, 11, 21]. While most recommenda- 
tion algorithms like Collaborative Filtering (CF) are applied 
to two-way data containing user preferences over items, the 
clickthrough data analysis deals with three-way data. As far 
as we know, previous literature on recommendation contains 
few studies on data of this kind. The three-way clickthrough 
data imposes at least two challenges: 

1) The relations among user, query and Web page are 
complicated. There exist intra-relations among objects of 
the same type, as well as inter-relations among objects of dif- 
ferent type [25]. For personalized Web search tasks, what we 
are concerned about are the 3-order relations among them. 
That is, given a user and a query issued by the user, the 
purpose is to predict whether and how much the user is in- 
terested in a Web page. Therefore, a unified framework is 
needed to model the multi-type objects and the multi-type 
relations among them. 

2) The three-way data are highly sparse. As we know, 
most CF algorithms are susceptible to data sparsity [21, 
3]. For clickthrough data, the sparseness problem becomes 
more serious because each user only submits a small number 
of queries, and only a very small set of Web pages are visited 
by each user. Latent Semantic Indexing (LSI) [7] has been 
proved useful to address the data sparseness problem in two- 
way data recommender systems [20, 21], however, it is still 
an open problem for the three-way data case. 

In order to address the problems mentioned above, we 
need an approach dealing with the clickthrough data which 
is three-way and highly sparse. In this paper, we develop 
a unified framework to model the three types of objects: 
user, query and Web page. The clickthrough data is rep- 
resented by a 3-order tensor, on which 3-mode analysis is 
performed using the Higher-Order Singular Value Decom- 
position (HOSVD) technique [15]. Because our tensor rep- 
resentation is 3-dimensional and our approach is a multilin- 
ear extension of the matrix Singular Value Decomposition 
(SVD), we name it CubeSVD. 
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The remainder of this paper is organized as follows. Sec- 
tion 2 provides related work. Section 3 gives a brief intro- 
duction to SVD and HOSVD techniques. Section 4 describes 
our proposed CubeSVD algorithm. Section 5 presents the 
experimental results and Section 6 offers some concluding 
remarks and directions for future research. 


2. RELATED WORK 


In this section we briefly present some of the research lit- 
erature related to personalized Web search, recommender 
systems, SVD for recommendation, clickthrough data rel- 
evant mining technique and Higher-Order Singular Value 
Decomposition (HOSVD). 

Some previous personalized search techniques, e.g., [2, 16, 
19], are mostly based on user profiling. Generally, user pro- 
files are created by asking users to fill out registration forms 
or to specify the Web page categories of their interests [1]. 
Users have to modify their preferences by themselves if their 
interests change. There are also some works on automatic 
creation of user preferences. In [23], user profiles were up- 
dated by accumulating their preferences reflected in the past 
browsing history. In [16], the user profile was represented 
by a hierarchical category tree and the corresponding key- 
words associated with each category. The user profile was 
automatically learned from the user’s search history. 

Many current Web search engines focus on hyperlink struc- 
tures of the Web. For example, Google calculated a univer- 
sal PageRank vector which reflects the relative importance 
of each page. Personalized PageRank, which is a modifica- 
tion of global PageRank, was first proposed for personalized 
Web search in [18]. In [10], “topic sensitive” PageRank was 
proposed to improve personalized Web search. The authors 
proposed to compute a set of PageRank vectors which cap- 
ture the page importance with respect to a particular topic. 
Since no user’s context information is used in this approach, 
it is difficult to evaluate whether the results achieved satisfy 
a user’s information need. 

Besides search engines, many recommender systems have 
been developed which recommend movies, music, Web pages, 
etc. Most recommender systems analyze a matrix contain- 
ing user preferences over items. Among the algorithms used, 
Collaborative Filtering (CF) is a group of popular methods 
(6, 11]. The philosophy behind CF is to recommend items 
based on preferences of similar users. That is, if a group of 
users share similar interests, the items preferred by one user 
can be recommended to others of the group. Since neighbor- 
hood formation requires sufficient amounts of training data, 
CF is sensitive to data sparsity [21, 3]. In order to address 
this issue, Latent Semantic Indexing (LSI) was applied to 
recommender systems and promising results were achieved 
[20, 21]. LSI was based on truncated singular valued decom- 
position and has also been successfully used in information 
retrieval (IR) community [7]. In [21], the authors use LSI 
for two recommendation tasks: to predict the likeliness of 
a product preferred by a customer; and to generate a list 
of top-N recommendations. LSI was also studied in [22] for 
collaborative filtering applications. 

Web usage mining techniques have achieved great success 
in various application areas [13, 14, 17]. As far as we know, 
there was seldom works on incorporating three-way click- 
through data for personalized Web search. An exception is 
[3], which extended Hofmann’s aspect model to incorporate 
three-way co-occurrence data for recommendation problem. 
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Figure 1: Visualization of matrix SVD 


However, it was not used for Web search application. The 
technique introduced in [14] uses clickthrough data in or- 
der to improve the quality of Web search. The author uses 
the relative preferences between Web pages and learns the 
retrieval functions. In [25], the authors also examine the 
interrelated data objects of clickthrough data and put for- 
ward a reinforcement clustering algorithm to cluster these 
multi-type objects. 

The higher-order singular value decomposition technique 
was proposed in [15]. It is a generalization of singular value 
decomposition and has been successfully applied for com- 
puter vision problems in [24]. We propose to use the HOSVD 
technique for personalized Web search in this paper. 


3. SVD AND HOSVD 


Since our CubeSVD approach is based on HOSVD tech- 
nique, which is a generalization of matrix SVD, we first 
briefly review matrix SVD and then introduce tensor and 
the HOSVD technique. In this paper, tensors are denoted by 
calligraphic upper-case letters (A, B - - - ), matrices by upper- 
case letters (A, B---), scalars by lower case letters (a,b---), 
vectors by bold lower case letters (a, b---). 


3.1 Matrix SVD 


The SVD of a matrix is visualized in Figure 1. For a 
J, x Ig matrix F, it can be written as the product: 


Papago (1) 
where U = (uP ul? ao a) and U@) = (ay uy” as uy.) 


are the matrices of the left and right singular vectors. The 
column vectors ul?) 1 <i< i and ul”) 1 <j <h are 
orthogonal. S = diag(o1,02,:*- ,Omin(1,,I5)) is the diago- 
nal matrix of singular values which satisfy 01 > 02 >-:: > 
Omin(11,12) > 0. By setting the smallest (min{J1, I2} — k) 
singular values in S to zero, the matrix F is approximated 
with a rank-k matrix and this approximation is best mea- 
sured in reconstruction error. Theoretical details on matrix 
SVD can be found in [9]. 


3.2 Tensor and HOSVD 


A tensor is a higher order generalization of a vector (first 
order tensor) and a matrix (second order tensor). Higher- 
order tensors are also called multidimensional matrices or 
multi-way arrays. The order of a tensor A € RI1*/2%-- XIN 
is N. Elements of A are denoted as @j,...i,.--i, Where 1 < 
in < In. In tensor terminology, matrix column vectors are 
referred to as mode-1 vectors and row vectors as mode-2 
vectors. The mode-n vectors of an N-th order tensor A 
are the [,-dimensional vectors obtained from A by vary- 
ing the index in and keeping the other indices fixed, that 
is the column vectors of n-mode matrix unfolding A(n) € 
R'e xil In-1In41: IN) of tensor A. See [15] for details on 
matrix unfoldings of a tensor. 
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Figure 2: Visualization of a 3-order Singular Value 
Decomposition 


The n-mode product of a tensor A € R712% "XIN by a 
matrix M € R’*! is an Ty x In X +++ X In—1 X Jn X In41 X 
--+ X In-tensor of which the entries are given by 


(A Xn M); in ee iN = X Qit--in-tiningi- in Mjnin 


in 
(2) 
Note that the n-mode product of a tensor and a matrix is 
a generalization of the product of two matrices. It can be 
expressed in terms of matrix unfolding: 


Bin) = MAn) (3) 


where Bin) is the n-mode unfolding of tensor B = A xn M. 
In terms of n-mode products, the matrix SVD can be 
rewritten as F = S xı vo X2 ve), By extension, HOSVD 


is a generalization of matrix SVD: every I x I2 x --: x In 
tensor A can be written as the n-mode product [15]: 
A=S X1 Vi X2 V2- Xn Vn (4) 


as illustrated in Figure 2 for N = 3. V, contains the or- 
thonormal vectors (called n-mode singular vectors) span- 
ning the column space of the matrix A(n) (n-mode matrix 
unfolding of tensor A). S is called core tensor. Instead of 
being pseudodiagonal (nonzero elements only occur when 
the indices satisfy i1 = i2 = --- = in), S has the prop- 
erty of all-orthogonality. That is, two subtensors Si„=a and 
S;,,=g are orthogonal for all possible values of n , a and 8 
subject toa # 8 . At the same time, the Frobenius-norms 
of = ||S;,,=:|| are n-mode singular values of A and are in 
decreasing order: of > o7 >- > of, 2 0.1 S is in general 
a full tensor and governs the interactions among Vn. 


4. CUBESVD BASED WEB SEARCH 


When using a search engine to find information: a user(u) 
submits a query(q), the search engine returns a list of URLs 
and the corresponding descriptions of the target Web pages, 
then the user clicks on the pages(p) of interest. After some 
time of usage, the search engine accumulates a collection 
of clickthrough data, which can be represented by a set of 
triplets (u,q,p). From the clickthrough data, we can con- 
struct a 3-order tensor A € RUX®@*?, where U,Q,P are sets 


'The Frobenius-norm of a tensor A is defined as ||Al| = 
/ (A, A). And the scalar product (A, B} of two tensors A,B 
is defined as (A,B)=)0,, ii, °° iy Miria-in * 
S;,,=i is the subtensor of S obtained by fixing the nth index 
of S to i. More details are referred to [15]. 
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Table 1: Details of the Web Pages Used in the Toy Problem 


Page URL Title 
Pl http://www.bmw.com BMW International Website 
p2 http://www.audiusa.com Audiusa.com Home Page 
P3 http://www.jaguarusa.com/us/en/home.htm Jaguar Cars 
p4 | http://dspace.dial.pipex.com/agarman/bco/ver4.htm Big Cats Online Home 


1. Construct tensor A from the clickthrough data. Sup- 
pose the numbers of user, query and Web page are m, n, 
k respectively, then A € R™*"**. Each tensor element 
measures the preference of a (user, query) pair on a Web 
page. 
2. Calculate the matrix unfolding Au, Ag and Ap from 
tensor A. Ax is calculated by varying user index of ten- 
sor A while keeping query and page index fixed. A, and 
A, are computed in a similar way. Thus A,, Ag, Ap is 
a matrix of m x nk, n x mk, k x mn respectively. 
3. Compute SVD on Au, Ag and Ap, set Vu, Vq and Vp 
to be the left matrix of the SVD respectively. 
4. Select mo € [1, m], no € [1, n] and ko € [1, k]. Remove 
the right-most m — mo, n — no and k — ko columns from 
Vu, Va and Vp , then denote the reduced left matrix by 
Wau, Wq and Wp respectively. Calculate the core tensor 
as follows: 

S=Ax1Wi x2 WẸ x3 WE (5) 


5. Reconstruct the original tensor by: 


Â= S xı Va X2 Va X3 Vp 


(6) 


Figure 3: Outline of the CubeSVD algorithm. 


of users, queries and pages respectively. Each element of 
tensor A measures the preference of (u, q) pair on page p. 
In the simplest case, the co-occurrence frequency of u, q 
and p can be used. In this paper, we also tried several other 
approaches to measure the preference. After tensor A is 
constructed, the CubeSVD algorithm can be applied on it. 


4.1 CubeSVD Algorithm 


Our CubeSVD approach is to apply HOSVD on the 3- 
order tensor constructed from the clickthrough data. In ac- 
cordance with the HOSVD technique introduced in Section 
3.2, the CubeSVD algorithm is given in Figure 3: 

the input is the clickthrough data, the output is the re- 
constructed tensor A. A measures the associations among 
the users, queries and Web pages. The elements of A can be 
represented by a quadruplet (u,q, p, w), where w measures 
the likeliness that user u will visit page p when u submits 
query q. Therefore, Web pages can be recommended to u 
according to their weights associated with (u, q) pair. 


4.2 A Toy Problem Example 


In this subsection, in order to illustrate how our approach 
works, we apply the CubeSVD algorithm to a toy problem. 
As illustrated in Figure 4, 4 users issued 4 different queries 
(“bmw”, “audi”, “jaguar”, “big cat”) and clicked on 4 Web 
pages. In Figure 4, the arrow line between a user and a query 
represents the user issued the corresponding query. The line 
between a query and a page indicates the user clicked on 
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Figure 4: Clickthrough data of the toy problem. 


the page after he/she issued the query. The numbers on the 
arrow line gives the correspondence between the three types 
of objects. For example, user ui issued query “bmw” and 
then clicked on page pi. The users performed seven clicks 
on the 4 pages in this toy problem. The URLs and titles of 
the pages visited are given in Table 1. Query “jaguar” may 
refer to “jaguar car” or “jaguar cats”. From Table 1, we can 
find that pı, p2 and p3 are Web pages on “cars”, page pa is 
related to “cats”. From Figure 4, we can see that user wi 
and uz have common interests on cars, while user ug and u4 
are interested in big cat animals. 

A 3-order tensor A (4x 4x 4) can be constructed from the 
clickthrough data. For simplicity, we assume there are no 
duplicate page visits. That is, if a user issues a query and 
then clicks on a Web page, the user only clicks on the page 
once. We use the co-occurence frequency of user, query and 
page as the elements of tensor A, which are given in Table 
2. After performing the CubeSVD analysis, we can get the 
reconstructed tensor A. Table 3 gives the output of the 
CubeSVD algorithm, as illustrated in Figure 5. In Table 3, 
the rows in italic font represents that this link relation does 
not exist in the original clickthrough data. 

As given in Table 3 and Figure 5, the output of the Cube- 
SVD algorithm for this toy problem is interesting: new as- 
sociations among these objects come out. From the original 
clickthrough data (Figure 4), we can find that neither user 
ui nor ua issued query q3. There is also no direct indication 
on which pages to recommend if either of the two users sub- 
mits query q3, because query q3 is ambiguous. According 
to the algorithm outputs given in Table 3, the element of A 
associated with (u1,q3, p3) is 0.354 and elements associated 
with other pages are zero. Thus if ui issues query q3, then 
uz is likely to visit page p3 (arrow line 9). Similarly, if user 
ua submits query q3, then wa is likely to visit p4 (arrow line 


Table 2: Tensor Constructed from the Clickthrough 
Data of the Toy Problem 


Arrow Line | User | Query | Page | Weight 
1 u1 qı pl 1 
2 u2 qı Pı 1 
3 u2 q2 p2 1 
4 u2 q3 p3 1 
5 u3 q3 pa 1 
6 u3 q4 P4 1 
T u4 q4 pa 1 


Table 3: Output of CubeSVD Algorithm on the Toy 


Problem 
Arrow Line | User | Query | Page | Weight 
1 u1 qı pi 0.5 
2 ug qı Pı 1.207 
3 u2 q2 p2 0.853 
4 u2 q3 p3 0.853 
5 u3 q3 pa 0.723 
6 u3 q4 p4 1.171 
7 u4 q4 pa 0.723 
8 U1 q2 p2 0. 354 
9 ui q3 p3 0.354 
10 u4 q3 Pa 0.447 


10). The results are reasonable since ui is concerned about 
cars rather than big cat animals, while u4 is opposite. Even 
the two users have not issued query q3, our algorithm can 
still recommend Web pages by analyzing the clickthrough 
data. That is, the CubeSVD approach is able to capture 
the latent associations among the multi-type data objects: 
user, query and Web page. The associations can then be 
used to improve the Web search accordingly. 


4.3 Dimension Selection 


The latent associations among the three types of objects 
captured by CubeSVD are stored in the reconstructed tensor 
A. From step 5 of the CubeSVD algorithm in Figure 3, we 


Figure 5: Illustration of the CubeSVD algorithm 
output for the toy problem given in Figure 4. 
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know tensor A is constructed by the product of the core ten- 
sor S and the left matrix V,,, V, and V, and the dimensions 
of S are selected in step 4. Since the core tensor S governs 
the interactions among user, query and Web page objects, 
the determination of core tensor dimensionality may play an 
important role in the result of the algorithm. This is further 
verified by our experiments in Section 5. 

Recall in the two-dimensional case [7], LSI computes a low 
rank approximation of the original term-by-document ma- 
trix to capture the semantic concepts of a document set. The 
resulted matrix is calculated by truncated SVD as Figure 1 
indicates. Previous experiments indicate that the number of 
singular values kept in the diagonal matrix © is crucial for 
LSI’s performance [12]. And how to determine the dimen- 
sion is still an ongoing research problem. 

For the CubeSVD approach, determination of the core 
tensor’s dimensions seems more difficult than LSI. Because 
for LSI, the term-by-document matrix is two dimensional, 
thus only one parameter (the number of nonzero singular 
values) needs to be decided. For CubeSVD, there are three 
dimensional parameters to be determined. According to the 
CubeSVD algorithm in Figure 3, the core tensor S is cal- 
culated from the product of tensor A by W., W, and W,. 
Therefore how many columns of Vu, Vq, and Vp are kept de- 
termines the dimensions of the core tensor (mo x no x ko). 
Since the left matrix Vu, Vz and Vp are calculated by solving 
SVD problems on the matrix unfolding Au, Ag and Ap re- 
spectively, in this paper we use an eigenvalue based method 
to determine the core tensor dimensions empirically. 

According to the tensor decomposition property [15]: 


m n k 


S er Ee Se ek? 


iu=mo+1 ig=notl ip=ko+1 
(7) 


By discarding the smallest n-mode singular values omo+1, 
. ORO kot1 --,0% to zero, we obtain 


IA- ÂI < 


Hoh ant 
approximation A of the original tensor A. As discussed i 
[15], if om9,On and of, are much bigger than omo+1; T9415 
oh, 41 respectively, the energy lost is not significant and is 
bounded as in Equation 7. Based on this property, we use 
the eigenvalues in the three matrix unfolding SVD problems, 
i. e., the smallest eigenvalues are discarded, thus reducing 
the dimensionality of the core tensor to A- (m x n x k). In 
this paper, A is tuned empirically. 


4.4 Weighting Policy 


In our CubeSVD algorithm, the tensor value measures 
the preference of a (user, query) pair on a Web page. If the 
page click frequency is used as tensor value, the algorithm 
is inclined to biasing towards tensor elements with high fre- 
quency. We also try three other weighting approaches: 

1) The first is a Boolean model. That is, for each 
(user, query) pair, if a page is clicked on, then the tensor 
value associated with the three objects is 1, otherwise 0. 

2) The second is by re-weighting of click frequency. We 
use a method used in IR community. For each clickthrough 
data triple (u,q,p), the weight of the corresponding tensor 
value is a re-weighting of the page click frequency f: 


fi = loge (1+ f) (8) 


The log function is used for scaling the page click frequency 
in order to reduce the impact of highly frequent visits. 


3) The third approach is similar with the second one. Here 
we take into account the Inverse Document Frequency (IDF) 
of a Web page (that is, frequency of a page visited by dif- 
ferent users). The intuition is that, if a Web page is visited 
by most users, then it is not representative for measuring 
users’ interests: 


f’ = loga (1+ f/fo) 


In Equation 9, fo denotes IDF of a Web page. 

The above three weighting schemes (denoted by Weight- 
Boolean, Weight_Log_Freq, Weight_Log_F'req_IDF respective- 
ly), as well as the scheme without weighting (denoted by 
Weight_Freq), are all tested in our experiments in Section 5. 


(9) 


4.5 Smoothing Scheme 


In the 2-dimensional case, LSI uses the co-occurrence of 
words and documents to capture the latent semantics of a 
document set: if two words co-occur frequently, they may be 
semantically related. In the 3-dimensional case, our Cube- 
SVD algorithm is applied on the clickthrough data, which 
contains the co-occurrence of the three types of objects: 
user, query and Web page. If the link relations among 
them are scarce, the latent associations may be difficult to 
capture. Generally, when a user issues a query, she may 
only visit a very small set of pages of interest, which may 
lead to a highly sparse tensor. In this work, we employ two 
smoothing methods to address the sparseness problem and 
the corresponding results are compared with the one with- 
out smoothing. 


4.5.1 Constant Based Smoothing 


For pages that a user query pair (u,q) does not visit, 
the corresponding tensor value is zero. An intuitive and 
straightforward smoothing method is to replace the zero ten- 
sor elements with a small constant c(0 < ce < 1). That is, 
even a page p is not visited by (u, q) according to the click- 
through data, it is assumed that page p is in general visited 
by u with a small probability if u issues query q. 


4.5.2 Page Similarity Based Smoothing 


The second smoothing method is based on content simi- 
larities between Web pages. For each user query pair (u, q), 
a set of pages Sı are visited. For each page p E€ S2 (S2 
denotes pages not visited by (u, q)), an overall similarity be- 
tween p and pages Sı can be calculated and used to replace 
the corresponding tensor elements: 


1 
[S1] Z aes, $P, 4) 


In Equation 10, s(p, a) measures the similarity between page 
p and a. Here, each page is represented by a vector of word 
weight and the similarity between two pages is measured by 
cosine of the angle between the corresponding vectors: 


sim(p, Si) = pe S2 (10) 


X, Wp, Wa; 
s(p,a) = Zi 2 (11) 
||wp|| - Ilwa] 
where wp, denotes weight of term j in page p. 
The two smoothing techniques, as well as no smooth- 
ing, are denoted by Smooth.Constant, Smooth.Content and 
Smooth_None respectively. 


4.6 Normalization 
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For the 2-dimensional case, when LSI is used for informa- 
tion retrieval, normalization scheme has a high impact on 
the retrieval precision [12]. Since the tensor A is of 3 di- 
mensions, it can be normalized from any dimension and the 
experiment result may be different. In this work, we com- 
pared all the three normalization methods. For example, if 
the tensor is normalized from the user dimension, then for 
each user u, all the tensor values corresponding with u are 
devided by a constant and the tensor values sum to 1 after 
division, that is: 


5 X Qiuigip = 1 


1<ig<n1<ip<k 


(12) 


Normalization from query or Web page dimension is similar. 
The three normalization methods are denoted by Normal- 
ize_User, Normalize_Query, Normalize_Page respectively. 
More is discussed in Section 5.4.2. 

There is an ordering issue when the techniques discussed 
in Sections 4.3-4.6 are combined with the CubeSVD algo- 
rithm. As discussed in Section 4.1, dimension selection is 
used in step 4 of the CubeSVD algorithm. Since the weight- 
ing, smoothing and normalization techniuges are used to 
construct a tensor from the clickthrouth data, they are ap- 
plied in the first step of CubeSVD. Similar with LSI ap- 
plied in IR applications, the order of the three kinds of 
techniques used is: weighting, smoothing and normaliza- 
tion. The weighting technique is first used to assign a value 
to the tensor elements associated with the (u,q,p) triples 
which occurred in the clickthrough data. Next, the smooth- 
ing techniques are used to replace some empty elements of 
the tensor. After smoothing is used, normalization is ap- 
plied in order to regard objects of the same type with equal 
importance in the tensor construction. For example, if the 
tensor is normalized from the user dimension, then each user 
is equally important for tensor construction, even though 
the number of queries each user issued or the number of 
pages each user visited may be different. After the weight- 
ing, smoothing and normalization techniques are applied, 
the tensor construction (step 1 in Figure 3) is complete. 


5. EXPERIMENTS 


In this section, we introduce the experimental data set, 
our evaluation metrics, and the experiment results. 


5.1 Data Set 


A set of MSN clickthrough was collected as our experi- 
mental data set. This data set contains about 44.7 million 
records of 29 days from Dec 6 of 2003 to Jan 3 of 2004. 
As we collected the clickthrough data, we crawled all Web 
pages of the ODP (http://dmoz.org/) directory (about 1.3 
million). The clickthrough data was split into two parts: a 
training and a test set. The former comprises of the first 
two weeks of data collection. The rest of the data is used 
for testing. For the training data, unique items with same 
user, query and Web page are grouped into one entry and 
the frequency is summed up. And we remove the Web pages 
which occurred in the clickthrough data but not crawled by 
our crawler. After this processing step, the training data 
contains 19,644,518 entries having 3,676,296 users, 248,149 
pages and 996,090 queries. That is, among the 1.3 million 
ODP Web pages, 248,149 of them are clicked by Web users in 
the first 2 weeks. Each user is identified by their IP address. 


Figure 6: Performance of CubeSVD as the dimensions of the core tensor vary. For the leftmost figure, 
the user dimension is fixed at 115 and the other two dimensions change. For the middle figure, the query 
dimension is fixed at 144. For the rightmost figure, the page dimension is fixed at 112. 


This is not appropriate sometimes when multi-users share 
one IP address or user accesses Web by dynamic IPs. In 
other words, the Web search may be conducted by a group 
of users. From the training dataset, we randomly select 500 
users’ clickthrough data and apply our CubeSVD algorithm 
on it. The noise is reduced by removing the Web pages which 
were visited by no more than 3 times and users who visited 
no more than 3 pages. Then we use these users’ clickthrough 
data from the test set to evaluate the search performance. 
In this work, we do not handle the new queries and new 
Web pages contained in the test set. The SVDPACKC/las1 
software package is used for SVD computation[5]. 


5.2 Baseline Algorithms 


For comparison purpose, we also investigate whether the 
3-order associations can be captured by the 2-dimensional 
SVD approaches. We apply LSI on the (user, query)-by- 
page matrix and use the reduced rank approximation of the 
original matrix for Web page prediction [22]. Besides, we 
also use the Collaborative Filtering algorithm in the experi- 
ments. For CF, we apply the memory-based algorithm with 
the vector similarity measure to form neighbors (Refer to 
Equation (1) and (3) in [6]). 


5.3 Evaluation Measurements 
We evaluate the Web search accuracy of different algo- 


rithms using rank scoring metric [6]. The expected utility 
of a ranked list of items is defined as 


R; =y ô(s, j) 


4 2G -D/071 
J 


(13) 


where j is the rank of a Web page in the list recommended, 
6(s,7) is 1 if a (user, query) pair s accessed page j in the 
test set and 0 otherwise, and a is set to 5 as the author did. 
The final score reflects the utilities of all (user, query) pairs 
in the test set: 


De Rs 
SR 
where RM is the maximum possible utility obtained when 


all pages that each (user, query) pair has accessed appear 
at the top of the ranked list. 


R= 100 (14) 
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5.4 Experimental Results 


We implemented all the 4 weighting methods, 3 smooth- 
ing schemes and 3 normalization methods discussed in Sec- 
tion 4, which lead to 36 different settings. In this work, we 
evaluated CubeSVD with all the settings. We also compare 
CubeSVD with CF and LSI in our experiments. 


5.4.1 Influence of the Core Tensor Dimensions 


We first conduct experiments to study the influence of 
core tensor dimensions on the performance of our Cube- 
SVD algorithm. When we apply CubeSVD to tensors con- 
structed with different weighting, smoothing and normaliza- 
tion methods, all the results show the search accuracy has 
high dependency on dimensions of the core tensor. For ex- 
ample, when we use Boolean weighting, normalization from 
query dimension without smoothing, we get a 500 x 168 x 
182(u x q x p) tensor. Dimensions associated with the three 
matrix unfoldings are 235, 157 and 182 respectively after 
SVD is performed. The CubeSVD algorithm achieves opti- 
mal accuracy (utility is 69.62) when the core tensor dimen- 
sion is 115, 144 and 112 respectively. If one dimension of the 
core tensor is fixed, we can find the search accuracy varies 
as the other two dimensions change, as illustrated in Fig- 
ure 6: the vertical axis denotes the utility measure and the 
other two axes denote the corresponding dimensions. For 
each figure, one dimension is fixed and the other two dimen- 
sions are varied. Each dimension increases in step (0.1 x 
the corresponding highest dimension) and is measured with 
fraction. 

We also employed our eigenvalue based method to deter- 
mine dimensions of the core tensor. The parameter is 
varied from 0.1 to 1 in step 0.1. For this experiment, when 
Aà = 0.9, we get a 211 x 141 x 163 dimension core tensor and 
the utility achieved is 68.6, which is approximate with the 
optimal result (utility 69.62). 


5.4.2 Influence of Weighting, Smoothing and Nor- 
malization Methods 


According to our experiment results, we find normaliza- 
tion from query dimension is slightly better than normal- 
ization from user or page dimension. Even when different 
weighting or smoothing techniques are used, this conclu- 
sion is consistent. We give a group of experiment results in 
Figure 7, these results correspond with normalization from 
query dimension. Different weighting and smoothing meth- 
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60 
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Smooth_None Smooth_Constant Smooth_Content 20 
Figure 7: Search Results of CubeSVD algorithm 16 


normalized from query dimension, associated with 
different weighting policies and smoothing schemes. 


ods are used in this experiment. We can find that the weight- 
ing policy may influence the search results, especially when 
the log frequency weighting method is used. The Boolean 
model performs worst compared with the other three weight- 
ing methods. Out of our expectation, the Weight_Log_Freq- 
_IDF weighting method is not so good as Weight_Log_Freq 
method, sometimes even worse than without weighting sche- 
me (Weight_Freq). From Figure 7, we can also find that 
smoothing can improve the search accuracy. Even the con- 
stant based smoothing method (c = 0.05 in this experiment) 
outperforms the one without smoothing. The page similar- 
ity based smoothing approach is better than constant based 
smoothing. 


5.4.3 Comparison with Other Approaches 


We also conduct experiments to compare CubeSVD with 
LSI and CF. In all the settings, CubeSVD outperforms both 
LSI and CF. Figure 8 describes the results of the three al- 
gorithms with page similarity based smoothing and normal- 
ization from query dimension. Results associated with the 4 
weighting methods are plotted. For LSI, the reduced dimen- 
sion varies from 1 to the highest possible dimension (the ma- 
trix rank) and the best result is reported. For CF, we vary 
the number of neighbors and report the best result. Accord- 
ing to the results, we can find CubeSVD outperforms either 
of the two baseline algorithms significantly. 


5.4.4 Discussions 


From the experiments, we observe that CubeSVD achieves 
better search accuracy than CF and LSI. The reason is 
CubeSVD can exploit the clickthrough data to capture the 
latent associations among the multi-type objects. And this 
kind of high order associations can not be well captured by 
CF or LSI applied on the 2-dimensional matrix data. 

We can also find that the core tensor dimensionality is 
crucial to the performance of CubeSVD. Different weight- 
ing, smoothing and normalization methods also have im- 
pacts on the search accuracy. According to the experimental 
results, the Weight_Log_Freq approach is the best weighting 
method. When Inverse Document Frequency is used, the 
search result does not improve. In our opinion, the reason 
is: there do not exist so many pages which are frequently vis- 
ited by users with different interests. Therefore, when IDF 
is used for weighting, the search accuracy even decreases. 
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Boolean Freq Log_Freq Log_Freq_IDF 


Figure 8: Search Results of CF, LSI and CubeSVD. 


Smoothing techniques can improve the search result. Since 
the page content information is used, the page similarity 
based smoothing is better than constant based smoothing. 
The effect of similarity based smoothing for sparse data is 
also observed in [3]. 

By analyzing the CubeSVD algorithm illustrated in Fig- 
ure 3, we can find that most time is consumed by steps 3-5. 
In step 3, SVD is performed on the three unfolded matrices. 
If the tensor scale is large, this step is quite time-consuming. 
Especially if smoothing is used, the original sparse tensor 
becomes relatively dense and the scale of the SVD prob- 
lem increases. If no smoothing is used, there are many zero 
columns in the unfolded matrices which decrease the scale 
of the SVD problem. Even though the large scale CubeSVD 
algorithm is quite time-consuming, the computation can be 
performed offline beforehand. After the CubeSVD analysis, 
the results can be used to help search Web pages in real 
time. Because the preferences of each (user, query) pair on 
Web pages have been computed in advance. Thus the search 
results can be adapted to users according to the associations 
among Web pages, users and queries submitted. 


6. CONCLUSION AND FUTURE WORK 


Personalized Web search service will play an important 
role on the Web. This paper focuses on utilizing click- 
through data to improve Web search. A novel CubeSVD 
approach is proposed to deal with the clickthrough data 
which is three-way and highly sparse. We used a real-world 
data set to evaluate the CubeSVD algorithm combined with 
a variety of techniques, examining the impact of different 
weighing, smoothing and normalization methods. The ex- 
perimental results indicate that CubeSVD approach can sig- 
nificantly improve Web search performance. 

There are also many areas for future research: 

1) In our current work, we are concerned about the users 
whose clickthrough data was recorded. And only queries 
issued and pages clicked on by these users are considered. 
Therefore, it would be interesting to adapt our framework to 
newly emerged objects (new users, queries and Web pages). 
One possible approach is by combining the CubeSVD tech- 
nique with traditional content-based search model. 


2) The offline computation of CubeSVD is quite time- 
consuming, especially when the clickthrough data contains a 
large number of objects. With CubeSVD as a base approach, 
we will seek ways to improve its efficiency. 

3) We also plan to conduct more research on how to auto- 
matically determine the optimal dimensionality of the core 
tensor. 

4) The CubeSVD framework proposed in this paper is 
not limited to Web search but is general enough and can 
be applied to other applications where three-way relations 
exist. 


7. ACKNOWLEDGMENTS 


We thank Xue-Mei Jiang and Ya-Bin Kang for their help 
in preparing the data used in this work. We also express 
thanks to Xuan-Hui Wang for his comments on this paper 
and helpful discussions. 


REFERENCES 

Google personalized search. 
http://labs.google.com/personalized. 

My yahoo! http://my.yahoo.com/?myhome. 

P. Alexandrin, U. Lyle, P. David, and L. Steve. 
Probabilistic models for unified collaborative and 
content-based recommendation in sparse-data 
environments. In Proceedings of the 17th Annual 
Conference on Uncertainty in Artificial Intelligence 
(UAI-01), pages 437-444, San Francisco, CA, 2001. 
Morgan Kaufmann Publishers. 

R. B. Almeida and V. A. F. Almeida. A 
community-aware search engine. In Proceedings of the 
18th International Conference on World Wide Web, 
pages 413-421. ACM Press, 2004. 

M. Berry, T. Do, and S. Varadhan. Svdpackc (version 
1.0) user’s guide. Technical Report CS-93-194, 
University of Tennessee, 1993. 

J. S. Breese, D. Heckerman, and C. Kadie. Empirical 
analysis of predictive algorithms for collaborative 
filtering. In Proceedings of the Fourteenth Annual 
Conference on Uncertainty in Artificial Intelligence, 
pages 43-52. Morgan Kaufman, 1998. 

S. C. Deerwester, S. T. Dumais, T. K. Landauer, 

G. W. Furnas, and R. A. Harshman. Indexing by 
latent semantic analysis. Journal of the American 
Society of Information Science, 41(6):391—407, 1990. 
L. Fitzpatrick and M. Dent. Automatic feedback using 
past queries: social searching? In Proceedings of the 
20th Annual International ACM SIGIR Conference on 
Research and Development in Information Retrieval, 
pages 306-313. ACM Press, 1997. 

G. Golub and C. V. Loan. Matrix Computations, 2nd 
edition. The Johns Hopkins University Press, 
Baltimore, Maryland, 1989. 

T. H. Haveliwala. Topic-sensitive pagerank. In 
Proceedings of the 11th International Conference on 
World Wide Web, pages 517-526. ACM Press, 2002. 
J. L. Herlocker, J. A. Konstan, A. Borchers, and 

J. Riedl. An algorithmic framework for performing 
collaborative filtering. In Proceedings of the 22nd 
Annual International ACM SIGIR Conference on 
Research and Development in Information Retrieval, 
pages 230-237. ACM Press, 1999. 


[1] 


[2] 
[3] 


[11 


390 


[12] P. Husbands, H. Simon, and C. H. Q. Ding. On the 
use of the singular value decomposition for text 
retrieval. Computational Information Retrieval, pages 
145-156, 2001. 

X. Jin, Y. Zhou, and B. Mobasher. Web usage mining 
based on probabilistic latent semantic analysis. In 
Proceedings of the 2004 ACM SIGKDD International 
Conference on Knowledge Discovery and Data Mining, 
pages 197-205. ACM Press, 2004. 

T. Joachims. Optimizing search engines using 
clickthrough data. In Proceedings of the 8th ACM 
SIGKDD International Conference on Knowledge 
Discovery and Data Mining, pages 133-142. ACM 
Press, 2002. 

L. D. Lathauwer, B. D. Moor, and J. Vandewalle. A 
multilinear singular value decomposition. SIAM 
Journal on Matrix Analysis and Applications, 
21(4):1253-1278, 2000. 

F. Liu, C. Yu, and W. Meng. Personalized web search 
by mapping user queries to categories. In Proceedings 
of the 11th International Conference on Information 
and Knowledge Management, pages 558-565. ACM 
Press, 2002. 

B. Mobasher, H. Dai, M. Nakagawa, and T. Luo. 
Discovery and evaluation of aggregate usage profiles 
for web personalization. Data Mining and Knowledge 
Discovery, 6(1):61-82, 2002. 

L. Page, S. Brin, R. Motwani, and T. Winograd. The 
pagerank citation ranking: Bringing order to the web. 
Technical report, Stanford Digital Library 
Technologies Project, 1998. 

J. Pitkow, H. Schutze, T. Cass, R. Cooley, 

D. Turnbull, A. Edmonds, E. Adar, and T. Breuel. 
Personalized search. Communications of the ACM, 
45(9):50-55, 2002. 

M. H. Pryor. The effects of singular value 
decomposition on collaborative filtering. Technical 
Report PCS-TR98-338, Dartmouth College, Computer 
Science, Hanover, NH, June 1998. 

B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. 
Application of dimensionality reduction in 
recommender systems-a case study, 2000. 

N. Srebro and T. Jaakkola. Weighted low-rank 
approximations. In Proceedings of the 12th 
International Conference on Machine Learning, pages 
720-727. AAAI Press, 2003. 

K. Sugiyama, K. Hatano, and M. Yoshikawa. Adaptive 
web search based on user profile constructed without 
any effort from users. In Proceedings of the 13th 
International Conference on World Wide Web, pages 
675-684. ACM Press, 2004. 

M. A. O. Vasilescu and D. Terzopoulos. Multilinear 
image analysis for facial recognition. In ICPR, pages 
511-514, 2002. 

J. Wang, H. Zeng, Z. Chen, H. Lu, L. Tao, and W.-Y. 
Ma. Recom: reinforcement clustering of multi-type 
interrelated data objects. In Proceedings of the 26th 
Annual International ACM SIGIR Conference on 
Research and Development in Informaion Retrieval, 
pages 274-281. ACM Press, 2003. 


[13] 


[14] 


[15] 


[16] 


20 


21 


22 


[23] 


[24] 


[25] 


